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Abstract 

The  mixture  transition  distribution  (MTD)  model  was  introduced  by  Raftery  (1985) 
as  a  parsimonious  model  for  high-order  Markov  chains.  It  is  flexible,  can  represent  a 
wide  range  of  dependence  patterns,  can  be  physically  motivated,  fits  data  well,  and 
appears  to  be  a  discrete-valued  analogue  for  the  class  of  autoregressive  time  series 
models.  However,  estimation  has  presented  difficulties  because  the  parameter  space  is 
highly  nonconvex,  being  defined  by  a  large  number  of  nonlinear  constraints. 

Here  we  propose  an  efficient  computational  algorithm  for  maximum  likelihood  esti¬ 
mation  which  is  based  on  a  way  of  reducing  the  large  number  of  constraints.  This  also 
allows  more  structured  versions  of  the  model,  for  example  those  involving  structural 
zeros,  to  be  fit  quite  easily.  A  way  of  fitting  the  model  using  GLIM  is  also  discussed. 

The  algorithm  is  applied  to  a  sequence  of  wind  directions,  and  also  to  two  sequences 
of  DNA  bases  from  introns  from  genes  of  the  mouse.  In  each  case,  the  MTD  model  fits 
better  than  the  conventional  Markov  chain  model. 


1  Introduction 

There  are  many  examples  in  which  we  would  like  to  fit  high-order  Markovian  models  to  dis¬ 
crete  data.  However,  in  the  conventional  parametrisation  of  such  processes  the  number  of  pa¬ 
rameters  increases  geometrically  with  the  order,  so  that  parsimony  is  effectively  lost.  In  this 
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paper,  we  describe  some  computational  algorithms  for  fitting  a  parsimonious  autoregressive- 
like  Markov  model  known  as  the  Mixture  Transition  Distribution  (MTD)  model,  and  we 
illustrate  their  use  with  some  examples. 

The  MTD  model  was  introduced  by  Raftery  (1985a,b)  and  is  defined  as  follows.  Let 
{Wt  :  t  =  0, 1, . . .}  be  a  time  homogeneous  l  th  order  Markov  chain  on  a  finite  set  of  m  states 
(here  labelled  1,2,. . .  ,m),  and  let  the  transition  probabilities  be 

/»(*o  |  —  ,ii)  =  P{Xt+i  =  *o  |  Xt+i-j  =  i  u...,Xt  =  it),  t  =  0,1, -  (1) 

There  are  (m  —  1)  ml  independent  parameters  in  equation  (1).  Raftery’s  model  provides,  for 
/  >  1,  a  useful  parameter  reduction  in  (1)  by  supposing  that 

,*/)  =  X,  (2) 

i=i 

where  Q  =  {^(z  j  j ) }  is  a  column  stochastic  matrix  satisfying 

q(i\j)>0  and  q(r  |  j)  =  1,  j  =  1, . . .  ,m,  (3) 

r=l 

and 

Aj  +  •  •  •  4-  A/  =  1.  (4) 

Note  that  the  number  of  independent  parameters  is  now  m(m  —  1)  -f  /  —  1,  increasing 
only  linearly  in  l.  For  example,  when  there  are  m  =  4  states,  the  number  of  parameters  for 
a  second-order  ( l  =  2)  chain  is  13  in  the  MTD  model  as  against  48  in  the  usual  second-order 
Markov  chain  model.  To  ensure  that  the  transition  probabilities  are  properly  defined,  we 
also  require 

i 

X]  -\z  ?(*o  I  *;)  -  0  for  all  *0,  •  •  • ,  U-  (5) 

j=i 

Notice  that  when  (3),  (4)  and  (5)  are  satisfied,  p(io  |  ii, . .  •  ,  z/)  <  1  for  all  z0,  Zi, . . .  , z-/. 

The  MTD  model  is  so  called  because  the  conditional  probabilities  in  (2)  are  linear  combi¬ 
nations  of  contributions  from  the  past.  It  is  analogous  to  the  AR(/)  model  in  that  one  extra 
parameter  is  added  to  the  model  for  each  extra  lag  and  that  the  lagged  bivariate  distribu¬ 
tions  satisfy  a  system  of  matrix  equations  similar  to  the  Yule- Walker  equations.  In  some 
situations  it  has  a  direct  physical  interpretation  in  terms  of  the  probability  of  returning  to 
past  states,  or  r+atcs  close  to  them. 

We  have  found  the  MTD  model  useful  in  practice,  but  it  is  not  easily  fitted  because  of 
the  non-linear  nature  of  (2)  and  the  constraints  in  (5).  From  an  algorithmic  point  of  view, 
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there  are  really  two  special  cases  of  the  MTD  model,  these  being  determined  by  what  is 
assumed  (in  addition  to  (3)  and  (4))  about  the  {A;}.  With  the  positivity  assumption 


A,  >  0,  *  =  1, . . . ,  /,  (6) 

the  inequality  in  (5)  is  automatically  satisfied.  The  examples  given  in  Raftery  (1985a) 
satisfied  (6),  but  several  of  the  data  sets  we  have  analysed  do  not.  One  of  these  is  described 
later  in  the  paper.  Without  this  positivity  assumption,  equation  (5)  comes  into  play  in  a 
crucial  way,  and  computationally  it  becomes  very  important  to  be  able  to  reduce  the  large 
number  of  constraints  that  are  operating  there.  In  section  2  we  show  how  this  can  be  done 
and  in  section  3  we  give  several  examples. 

2  Parameter  Estimation 

2.1  Reducing  the  Number  of  Constraints 

We  saw  in  section  1  that  the  general  MTD  model  must  satisfy  the  ml  (m  —  1)  constraints 
in  (5).  For  example,  in  the  four-state  second-order  case,  i.e.  m  =  4  and  l  =  2,  the  number 
of  constraints  is  48,  so  that  the  resulting  constrained  numerical  optimization  problem  is 
computationally  demanding.  The  following  result,  which  reduces  the  effective  number  of 
constraints  in  (5)  to  m,  is  at  the  heart  of  our  fitting  algorithm: 

Proposition  1  Let  T  =  J2i:\,>o  ^ i>  ant 1  define  q„(i)  =  mini<;<m  q(i\j)  and  q+(i)  = 
maxi<j<m  q{i  j  j).  Then  £ '=1  A;  q(i  |  ij)  >  0  for  all  i, iu  . . . ,  tj  if  and  only  if 

T q~(i)  +  (1  -  T)  q+(i)  >  0  for  all  i.  (7) 

^  * 

Proof  If  (7)  holds,  then 

£j=i  A,  <?(*'  I  ij)  =  Ej:A;> 0  A>  q{i  \  ij)  +  £j:A,<0  A,  q(i  I  ij) 

^  E^>oAj9-(.)  +  Ej:Vo  Aj9+(0 

=  Tq-(i)  +  (l-T)q+(i) 

>  o.  *  ;  . 

Conversely,  assume  that  q~{i)  =  q{i  | po)  and  q+(i)  =  q{i  | pi ).  Then  ■  » 

T  q~(i)  +  (1  —  T)q+{i)  =  Ej;x,>o  Aj  q{i  \  Po)  ; 

+  Zy.\J<o  Aj?(i|pi)  -  ,  o..  .... 

>0.  .  .  , 

» » 

w\  , 
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2.2  Maximum  Likelihood  Estimation 

The  parameters  Q  and  {A,}  can  be  fitted  by  maximum  likelihood  by  maximising  the  log- 
likelihood 

log  I  =  •••,*/)  logp(i0  |*i,  — ,  */),  (8) 

where  n(i 0,  *j, . . . ,  i<)  is  the  number  of  times  the  sequence  ii  — ►  tj_i  i0  occurs  in  the 

data,  p(iQ  |  ij, . . . ,  it)  is  given  by  (2),  and  the  sum  is  over  ail  i0,  ii , . . . ,  it  with  n(i0,  i\, . . . ,  it)  > 
0.  The  maximisation  is  subject  to  the  constraints  (3),  (4)  and  (7). 

While  there  are  several  numerical  approaches  that  might  be  taken  to  this  problem,  we 
found  that  direct  maximisation  of  (8)  was  effective.  We  used  the  sequential  quadratic  pro¬ 
gramming  algorithm  implemented  as  E04UCF  in  Mark  13  of  NAG  (Numerical  Algorithms 
Group,  1988).  Although  derivatives  of  the  objective  function  and  the  constraint  functions 
may  be  calculated,  we  found  that  approximating  these  bv  finite  differences  was  effective. 

One  troublesome  part  of  the  algorithm  involves  the  storage  and  recovery  of  the  counts 
n(i0,  U  ,  •  •  • ,  */)•  For  models  with  high  values  of  /  or  m,  the  number  ml+l  of  potential  patterns 
can  be  extremely  large.  We  proceeded  by  labelling  a  pattern  i  =  i0,  »i, . . . ,  it  by  the  number 

j=0 

If  the  number  of  possible  patterns  was  sufficiently  small,  we  stored  the  whole  (now  one¬ 
dimensional)  array  of  counts.  On  the  other  hand,  the  maximum  number  N,  say,  of  patterns 
that  can  be  observed  in  the  data  is  a  little  less  than  the  length  of  the  observed  time  series, 
so  in  cases  where  m,+1  is  very  large  compared  to  IV,  we  calculated  and  stored  the  counts 
using  a  simple  hashing  algorithm  with  a  vector  of  length  approximately  N. 

The  types  of  data  we  have  analysed  have  led  to  several  useful  additional  features  of  the 
programs,  namely: 

1.  The  positive  A  model  satisfying  (6)  and  the  more  general  model  satisfying  (7)  may  be 
fitted  separately. 

2.  Structural  zeroes  in  the  Q  matrix  may  be  handled  directly.  Example  3.2  in  Raftery 
(1985a)  is  of  this  type. 

3.  Predetermined  A  values  may  also  be  set  to  zero,  corresponding  to  the  omission  of  those 
terms. 
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4.  Fixed  or  random  starts  for  the  parameters  in  the  iterative  scheme  are  allowed.  In  the 
first  instance,  Q  is  estimated  by  the  usual  first-order  transition  matrix,  and  the  A,  are 
equal.  In  the  second  instance,  random  Q  and  A^  are  used,  subject  to  (3)  and  (4).  This 
facility  is  particularly  useful  in  the  iterative  algorithm  for  determining  whether  a  local 
or  potentially  global  maximum  has  been  reached.  While  we  have  no  formal  proof  that 
a  unique  maximum  exists,  numerical  evidence  with  some  20  data  sets  suggests  that  it 
does.  A  more  formal  assessment  of  this  might  follow  Finch  et  al.  (1989). 

2.3  Minimum  x2  estimation 

As  an  alternative  to  maximum  likelihood,  we  have  also  used  minimum  x2  estimation.  The 
aim  is  to  find  Q  and  {A;}  that  minimise 

y-2  _  v-'  (n(ip, . . . ,  t;)  e(tp.  •  •  •  ,  if)) 

where 

e(*o,  •••,*/)  =  n(+,*i,...,i/)p(i0  hi,  ...,*/), 

and  +  denotes  summation  over  that  index.  The  sum  is  over  all  n(z0,  i1? . . . ,  ij)  for  which 
n(-(-,  tj ,  >  0,  and  the  constraints  in  (3),  (4)  and  (7)  apply.  This  is  a  useful  alternative, 

since  the  fitted  counts  e  from  the  optimisation  are  natural  candidates  aj  measures  of  goodness 
of  fit  of  the  model.  Kwok  (1988)  and  Li  and  Kwok  (1989)  have  shown  that  in  some  special 
cases  of  the  MTD  model,  the  minimum  x2  estimator  has  lower  bias  than  the  maximum 
likelihood  estimator  but  about  the  same  variance,  and  hence  lower  overall  mean  squared 
error. 

The  asymptotic  theory  of  X2  when  the  parameters  have  been  estimated  by  Maximum 
Likelihood  is  given  in  Billingsley  (1961),  where  it  is  shown  that  X 2  has  approximately  a 
X2  distribution  when  the  chain  really  is  of  order  l.  The  asymptotic  behavior  of  X2  in  the 
present  context  is  the  same. 

We  also  experimented  with  different  numerical  algorithms  for  this  problem,  essentially 
based  on  knowledge  of  derivatives  for  the  constraint  functions.  Once  more  the  direct  ap¬ 
proach  seems  the  easiest,  using  NAG  algorithm  E04UCF  again. 

Further  details  of  the  programs,  and  copies  of  the  code,  may  be  obtained  from  the  authors 
upon  request. 
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2.4  GLIM  analysis  of  two-state  models 

When  m  =  2,  the  MTD  model  may  be  fitted  using  an  iterative  procedure  in  GLIM  (Baker, 
1986).  Focus  for  the  moment  on  the  case  l  =  2,  and  write  A!  =  A,A2  =  1  —  A.  The 
log-likelihood  (8)  may  be  written 

logL  =  J2  fe  log  p(i  •  (9) 

'»=1  ' 

For  each  *2,  the  inner  term  in  (9)  is  (essentially)  a  binomial  log-likelihood  for  n(-f ,  ij,  i2) 
trials  and  success  probability  p(l  |  *11*2)1  where 


P(1  h'1,1'2)  = 


9(1 1 1)  *i  =  l,i2  =  1 

A9(l|l)  +  (1-A)9(l|2)  *i  =  1>  *2  =  2 
(1-  A)g(ljl)  +  A9(l|2)  =  2, i2  —  1 

5(1 1 2)  *1  =  2,  *2  =  2. 


(10) 


If  A  is  known,  then  (10)  shows  that  the  p(l  1  i‘i , *2)  are  linear  in  the  parameters  9(1  1 1) 
and  9(1 1  2),  9(1  1 1)  being  the  coefficient  of  the  covariate  xf  =  (1,  A,  1  —  A,  0)  and  9(1  |  2)  the 
coefficient  of  the  covariate  xj  =  (0, 1  —  A,  A,  1).  Thus  q(l  1 1)  and  9(1 1 2)  (and  so  Q )  may  be 
estimated  using  binomial  error,  identity  link,  no  intercept  and  covariates  Xi  and  x2. 

On  the  other  hand,  if  9(1 1 1)  and  9(1  ( 2)  are  assumed  known,  then  (10)  shows  that 
p(l  |  i 2 )  is  linear  in  A;  A  is  the  coefficient  of  the  covariate  X3  =  (0, 9(1  1 1)  — 9(1  1 2),  9(1  1 2)  — 
9(1 1 1),0)  and  the  offset  is  x^  =  (9(1 1 1),  9(1  (2),  9(1 1 1),  9(1  |2)).  Thus  A  may  be  estimated 
using  binomial  error,  identity  link,  no  intercept,  covariate  X3,  and  offset  x4.  This  leads  to  a 
simple  recursive  scheme  for  estimating  the  parameters,  reminiscent  of  the  iterative  algorithms 
used  in  survival  analysis;  cf.  Aitken  et  al.  (1989),  Chapter  6. 

The  generalisation  to  /  >  2  is  almost  immediate  from  the  form  of  (10).  The  number  of 
covariates  for  the  first  stage  remains  2,  the  elements  of  Xj  being  replaced  by  YLy.i,= 1  A j.  For 
the  second  stage  the  number  of  covariates  is  /  —  1 .  It  does  not,  however,  seem  to  be  simple 
to  generalise  this  scheme  to  the  case  m  >  2. 


2.5  Model  comparison 

In  order  to  compare  the  rival,  non-nested,  models  in  the  examples  that  follow,  we  would 
ideally  like  to  compute  the  posterior  probability  of  each  model  under  a  range  of  plausible 
prior  distributions  for  the  parameters.  The  use  of  successive  significance  tests  seems  less 
satisfactory  because  many  of  the  comparisons  involve  non-nested  models  and  because  the 
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use  of  multiple  tests  make  the  properties  of  the  overall  procedure  hard  to  assess.  We  do 
not  adopt  information  criteria  for  the  selection  of  a  single  model,  because  conditioning  on  a 
single  selected  model  ignores  model  uncertainty  and  so  overstates  our  knowledge. 

Here  we  use  the  approximate  result  that  if  we  are  comparing  two  models,  Mo  and  A/j, 
then  the  Bayes  factor,  or  ratio  of  posterior  to  prior  odds,  Bo i,  for  Mo  against  Mi  satisfies 

—  2  log  Bo\  =  BICq  —  B1C\.  (11) 

In  (11),  BICi  =  — 21ogZf-  -f  fcilogn,  where  X,  is  the  maximized  likelihood  and  b  is  the 
number  of  independent  parameters  in  the  model  M,(z  =  0, 1).  Although  this  has  not  been 
formally  proved  for  the  MTD  model,  it  has  been  established  for  independent  exponential 
family  observations  by  Schwarz  (1978),  for  the  usual  Markov  chain  by  Katz  (1981)  and  for 
log-linear  models  of  contingency  tables  by  Raftery  (1986a).  The  MTD  model  appears  to 
satisfy  regularity  conditions  that  would  permit  the  adaptation  to  it  of  proofs  for  other  cases. 
If  (11)  is  always  some  baseline  model  such  as  the  independence  model  and  (11)  is  calculated 
for  each  model  of  interest  M\,  then  the  resulting  Bayes  factors  readily  yield  the  posterior 
probability  of  each  of  the  models  of  interest  (Raftery,  1988).  The  rules  of  thumb  of  Jeffreys 
(1961,  Appendix  B)  suggest  that  such  a  comparison  should  not  be  regarded  as  decisive  unless 
the  difference  in  BIC  values  is  at  least  about  10. 

Model  comparisons  based  on  posterior  probabilities  can  yield  results  different  from  those 
based  on  significance  tests.  This  is  especially  so  with  large  samples,  including  some  that 
we  analyse  here.  In  such  cases  significance  tests  at  fixed  significance  levels  often  reject  null 
hypotheses  more  easily;  an  example  with  n  =  110,000  was  discussed  by  Raftery  (1986b). 
This  is  related  to  the  “conflict  between  significance  and  P  values”  discussed  by  Berger  and 
Sellke  (1987).  Alternatively,  basing  model  comparisons  on  Bayes  factors  may  be  viewed  as 
an  automatic,  decision-theoretic,  way  of  setting  significance  levels  so  as  to  balance  power 
and  significance. 

Our  code  produces  Pearson  residuals  which  can  be  used  to  suggest  ways  in  which  the 
model  could  be  improved.  New  models  suggested  by  such  a  process  can  be  compared  with 
the  other  models  under  consideration  also  using  approximate  Bayes  factors. 

3  Wind  direction  data 

Raftery  et  al.  (1982)  and  Haslett  and  Raftery  (1989)  describe  a  data  set  that  includes  hourly 
observations  of  wind  directions  at  a  meteorological  station  at  Roche’s  Point,  Ireland.  The 
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data  that  we  analyse  here  began  at  1  am  on  January  1,  1961  and  ran  far  almost  9  years. 
There  are  77,155  observations  in  all.  The  original  data  were  recorded  as  0  for  no  wind,  and 
then  in  10  degree  units  from  due  North,  for  a  total  of  37  states. 

One  aim  here  is  to  predict  wind  speeds  and  directions  so  as  to  control  the  wind  turbine 
generators  making  up  a  wind  farm,  and  to  manage  the  electric  power  supply.  Wind  turbines 
should  be  oriented  in  such  a  way  as  to  derive  the  most  energy  from  the  wind,  so  that  their 
current  best  orientation  is  a  function  of  future  as  well  as  current  wind  direction.  Predicting 
output  from  a  variable  energy  source  such  as  wind  is  important  so  that  the  need  for  power 
from  other,  more  stable,  sources  such  as  oil  can  be  anticipated. 

Figure  1  provides  a  histogram  of  the  distribution  of  wind  directions  for  all  nine  years 
combined,  together  with  separate  histograms  for  each  of  the  8  complete  years  in  the  data 
set.  Broadly  speaking,  these  annual  histograms  are  rather  similar  and  show  natural  modes  in 
the  data  that  are  preserved  from  year  to  year.  Based  on  these  results,  we  chose  to  recode  the 
data  into  five  categories:  0,  6-14,  15-23,  24-32,  and  33-5;  these  are  labelled  1  to  5  respectively 
in  what  follows. 


Insert  Figure  1  about  here 


As  might  hav*»  anticipate,  there  ar**  some  inhomogeneities  in  the  distribution  of 
wind  directions  which  are  revealed  when  the  data  are  analysed  in  separate  months.  See 
Figure  2. 

Insert  Figure  2  about  here 


These  distributions  are  rather  similar  for  the  months  of  November  through  April.  We 
therefore  chose  the  months  November  through  April  as  a  period  in  which  wind  directions 
might  be  modelled  by  a  stationary  MTD  model.  The  data  analyzed  below  come  from  the 
period  November  1961  to  April  1962,  providing  a  total  of  4344  consecutive  hourly  observa¬ 
tions. 

Insert  Table  1  about  here 


Table  1  gives  the  BIC  values  for  the  full  Markov  model,  and  the  MTD  model.  The  order 
is  estimated  to  be  7.  The  estimated  parameters  are  Aj  =  0.591,  A2  =  .237,  A3  =  .076,  A4  = 
.018,  A5  —  .024,  A6  =  .024,  A7  =  .031,  while 
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/  .65 

.01 

.01 

.01 

.01  \ 

.09 

.95 

.01 

.00 

.03 

.14 

.02 

.92 

.04 

.00 

.05 

.00 

.06 

.92 

.04 

k  .08 

.03 

.00 

.03 

.91  ) 

The  estimated  Aj’s  are  positive,  indicate  that  the  most  recent  observations  are  the  most 
important,  and  that  the  current  observation  tends  to  be  close  to  the  immediately  preceding 
ones,  as  we  would  expect.  The  Q  matrix  indicates  the  process  to  be  smooth,  with  the 
probability  of  staying  in  the  same  state  being  0.91  or  greater  whenever  there  is  any  wind, 
and  the  probability  of  the  direction  changing  by  more  than  one  state  being  very  small. 

We  note  that  A7  is  larger  than  A4,  A5  and  A6,  probably  due  to  the  fact  that  the  A7  term 
is  capturing  the  small  residual  dependence  on  Xt~%,  Xt- 9, . . .,  as  well  as  the  dependence  on 
Xt~7  itself.  This  suggested  that  we  fit  another  MTD(7)  model,  with  the  constraint  that 
A4  =  A5  =  A6  =  0.  As  we  discussed  in  section  2.2,  this  is  easy  to  do  using  our  algorithm. 

The  resulting  BIC  value  was  4530.4,  making  this  quite  clearly  the  best  model  considered. 
The  Q  matrix  was  almost  unchanged,  while  =  .598,  A2  =  .245,  A3  =  .100, A7  =  .057.  This 
is  not  very  different  from  the  full  MTD(7)  model,  but  it  seems  to  summarize  the  dependence 
in  a  more  parsimonious  way. 

4  The  analysis  of  intron  sequences 

The  area  of  biomolecular  sequence  comparison  has  provided  statisticians  with  a  wealth  of 
novel  problems.  As  an  example,  descriptive  statistics  on  DNA  composition  have  proved  useful 
in  the  search  for  coding  regions  and  introns,  the  statistical  assessment  of  sequence  similarity, 
and  the  analysis  of  repeated  motifs  that  may  be  of  biological  significance.  Similarly,  statistical 
analysis  of  protein  sequences  of  known  three-dimensional  structure  has  been  used  to  infer 
potential  folding  patterns  of  other  proteins.  It  is  not  our  purpose  here  to  describe  these 
areas  in  any  detail;  rather,  we  refer  the  reader  to  the  recent  review  article  by  Curnow  and 
Kirkwood  (1989)  and  the  books  edited  by  Waterman  (1988)  and  Doolittle  (1990)  for  good 
introductions  to  this  important  field.  In  this  section,  we  will  focus  on  just  one  example  from 
the  area  of  DNA  sequence  analysis  to  which  the  MTD  model  might  be  applied. 

The  statistical  significance  of  repeated  patterns  in  a  DNA  sequence  must,  of  course,  be 
measured  against  the  background  stochastic  structure  of  the  sequence  itself.  Among  possible 
models  for  this  structure  are  Markov  chains,  which  might  describe  the  DNA  sequence  in 
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terms  of  its  nucleotide  composition  (that  is,  as  a  string  of  letters  from  a  four-letter  alphabet, 
{ A ,  C,  G ,  T  }).  There  are  several  other  alphabets  of  biological  interest  such  as  the  purine- 
pyrimidine  alphabet  in  which  each  base  in  the  sequence  is  coded  as  either  purine  ({.4,  G }) 
or  pyrimidine  ({C,  T  }).  For  example,  Blaisdell  (1983)  reported  that,  relative  to  a  model  of 
independent  bases,  non-coding  sequences  (such  as  introns)  generally  contain  a  shortage  of 
runs  of  length  1  and  2  of  purines  and  pyrimidines,  and  an  excess  of  long  runs  of  them;  see 
also  Karlin,  Ost  and  Blaisdell  (1988).  In  this  section,  we  describe  an  exploratory  analysis  of 
two  different  DNA  sequences  from  introns  in  certain  mouse  genes. 

4.1  The  mouse  T-cell  receptor  a/6  locus 

The  first  example  is  an  analysis  of  part  of  the  mouse  T-cell  receptor  a/ 6  locus  (Wilson  et  at., 
1991;  Koop  et  at .,  11991).  This  region  is  94,647  bases  in  length.  It  comprises  over  50  introns 
and  50  exons;  the  exons  comprise  just  6%  of  the  sequence.  The  particular  sequence  we  have 
analyzed  is  the  intron  prior  to  joining  gene  segment  J50  (Koop  et  at.,  1991).  It  starts  5'  to 
exon  1  of  V65,  and  ends  three  bases  before  the  recombination  signal  5'  to  J50.  The  sequence 
is  5,778  bases  in  length. 

A  preliminary  analysis  shows  that  the  sequence  is  clearly  first-order  when  analyzed  in  the 
four-letter  alphabet  {A,G,C, T};  see  Table  2.  The  MTD  model  provides  no  improvement 
on  this  fit.  The  estimated  transition  matrix  Q  is 


last 

base 

A 

G 

c 

T 

A 

.31 

.29 

.32 

.18 

next 

G 

.27 

.27 

.04 

.29 

base 

C 

.20 

.22 

.27 

.26 

T 

.22 

.21 

.37 

.27 

Insert  Table  2  about  here 

Recall  that  a  Markov  chain  with  transition  matrix  Q  is  (strongly)  lumpable  with  respect 
to  a  partition  Pi, . . . ,  Pr  of  the  state  space  if,  and  only  if,  for  1  <  i,j  <  r,J2ieP}  qik  has  the 
same  value  for  each  h  €  P,  (Kemeny  and  Snell,  1960).  The  lumpability  condition  ensures 
that  the  lumped  process  is  also  Markovian,  no  matter  what  the  initial  distribution  of  the 
original  process  might  be.  An  examination  of  the  matrix  Q  above  indicates  that  we  might 
simplify  the  stochastic  description  of  this  intron  by  lumping  the  states  A  and  G  into  a  single 
state  A/G  that  denotes  purine.  A  formal  test  of  this  lumpability  hypothesis  may  be  found 
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in  Thomas  and  Barr  (1977).  The  new  alphabet  is  {A/G,  C,  T).  The  BIC  analysis  of  this 
new  sequence  is  presented  in  Table  2.  As  would  be  expected,  among  the  fully-parametrized 
Markov  models  a  first  order  chain  provides  the  most  parsimonious  description.  Its  estimated 
transition  matrix  is 

last  base 
A/G  C  T 
next  A/G  .57  .36  .47 

base  G  .21  .27  .26 

C  .22  .37  .27 

However,  it  can  be  seen  from  Table  2  that  a  second  order  MTD  model  provides  a  better 
description.  In  this  case,  Ai  =  0.71,  A2  =  0.29,  and  the  estimated  transition  matrix  Q  is 

last  base 


A/G 

C 

T 

next 

A/G 

.60 

.33 

.44 

base 

G 

.20 

.28 

.26 

C 

.20 

.39 

.29 

Finally,  we  analyze  the  purine-pyrimidine  alphabet  {A/G,C/T}.  From  the  previous  dis¬ 
cussion,  it  seems  clear  that  the  original  sequence  is  not  lumpable  with  respect  to  the  purine- 
pyrimidine  partition  of  the  states.  The  purine-pyrimidine  sequence  may  not  be  Markovian  (or 
even  homogeneous,  unless  we  assume  that  the  original  chain  was  stationary).  Fitting  Marko¬ 
vian  models  to  such  data  provides  an  exploratory  approach  to  approximating  the  stochastic 
structure  of  a  complicated  process  by  simpler  ones.  One  might  expect  this  more  complicated 
structure  to  be  reflected  in  a  higher  estimated  order  of  dependence  in  the  Markovian  approx¬ 
imation.  This  is  indeed  the  case  here,  as  the  results  in  Table  2  verify.  The  purine-pyrimidine 
chain  is  approximated  by  a  second-order  Markov  chain.  The  MTD  model  offers  no  further 
improvement  in  this  case. 

The  discussion  of  lumpability  provides  an  indication  of  the  greatest  extent  to  which  the 
states  of  the  chain  can  be  aggregated  without  losing  important  structure.  Here,  this  seems 
to  be  the  three-state  case  analyzed  above.  Thus,  in  a  sense,  the  second-order  MTD  model 
for  the  three-state  case  provides  the  most  parsimonious  available  representation  of  the  data 
within  the  class  of  Markov  chain  models  discussed. 

4.2  The  mouse  aA-crystallin  gene 

Avery  (1987)  examined  the  Markovian  structure  of  introns  from  several  other  genes  in  mouse, 
in  order  to  determine  whether  certain  short  DNA  sequences  occurred  more  often  than  would 
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be  expected  by  chance.  Here  we  will  analyse  the  introns  from  the  mouse  aA-crystallin 
gene,  further  details  of  which  may  be  found  in  Avery’s  paper.  The  sequence  analyzed  here 
comprises  two  introns,  of  total  length  1307  bases. 

Following  the  style  of  analysis  of  the  previous  example,  we  see  first  that  the  sequence 
of  bases  with  xlphabet  {A,  C,  G,T}  is  clearly  indicated  to  be  of  order  1;  see  Table  3.  The 
estimated  transition  matrix  Q  is  given  by 


last 

base 

A 

G 

C 

T 

A 

.23 

.23 

.30 

.19 

next 

G 

.34 

.32 

.06 

.30 

base 

C 

.25 

.27 

.34 

.28 

T 

.18 

.19 

.30 

.23 

Insert  Table  3  about  here 


Notice  that  this  transition  matrix  is  qualitatively  rather  similar  to  the  corresponding 
matrix  for  the  T-cell  receptor  intron  discussed  in  the  previous  section.  In  particular,  this 
sequence  is  also  (approximately)  lumpable  with  respect  to  the  partition  {A/G,  C,  T}.  The 
results  are  again  consistent  with  the  previous  example,  in  that  among  the  fully-parametrized 
Markov  models,  the  first  order  model  provides  the  best  description.  However,  a  more  parsi¬ 
monious  description  is  provided  by  an  MTD(2)  model  in  which  Aj  =  2.46,  A2  =  —1.46,  and 
the  estimated  transition  matrix  Q  is 


last  base 


A/G 

C 

T 

next 

A/G 

.52 

.43 

.49 

base 

G 

.27 

.32 

.29 

C 

.21 

.25 

.22 

Note  that  in  this  example  the  likelihood  is  maximized  by  some  negative  values  of  the  A,. 
The  constraints  are  maintained  by  using  the  method  outlined  in  Proposition  1. 

Finally,  the  analysis  of  the  collapsed  chain  in  its  purine-pyrimidine  alphabet  is  given  in 
Table  3.  The  odds  for  the  data  being  generated  by  a  second-order  MTD  model  as  against 
a  first-order  Markov  chain  are  about  2:1,  by  (11).  This  does  provide  some  evidence  for  the 
chain  being  of  order  two,  although  in  the  words  of  Jeffreys  (1961)  it  is  “not  worth  more 
than  a  bare  mention”.  (The  standard  likelihood  ratio  test  statistic  is  8.7  with  one  degree  of 
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freedom,  and  the  corresponding  P-value  from  the  asymptotic  \2  distribution  is  about  0.003. 
Thus  the  approximate  Bayes  factor  and  the  approximate  P-value  point  in  the  same  direction 
but,  as  usual,  the  P-value  suggests  stronger  evidence  for  the  larger  model.)  The  parameter 
estimates  from  the  GLIM  algorithm  described  in  section  2.4  are  identical,  and  the  minimum 
X2  estimates  are  essentially  the  same. 

In  this  example,  the  structure  of  the  purine-pyrimidine  sequence  is  captured  by  the  MTD 
model,  rather  than  by  the  fully  parametrized  Markov  model  that  was  required  for  the  earlier 
intron  sequence.  The  parameters  are  Ai  =  2.19,  A2  =  —1.19  and  the  estimated  transition 
matrix  Q  is 

last  base 
A/G  C/T 
next  A/G  .52  .45 

base  G  .48  .55 


4.3  Comments 

In  these  examples  the  emphasis  is  on  model  fitting  to  find  (or  approximate)  structure,  rather 
than  for  prediction.  We  have  seen  that  the  MTD(2)  model  provides  a  good  description  of 
the  two  intron  sequences  when  they  are  coded  in  the  {A/G,  C,  T}  alphabet.  Some  other 
sequence  analysis  examples  in  which  the  MTD  model  has  been  applied  appear  in  Tavare 
and  Giddings  (1988).  While  Markov  models  are  a  useful  first  step  in  this  context,  their 
validity  is  often  questionable  because  of  possible  inhomogeneities  in  the  sequence.  This 
inhomogeneity  is  particularly  pronounced  in  coding  regions  (exons),  where  it  is  well  known 
that  the  three  codon  positions  exhibit  markedly  different  behavior.  To  analyse  such  regions, 
more  sophisticated  non-homogeneous  Markov  models  may  be  required.  Some  of  these  are 
described,  for  example,  by  Tavare  and  Song  (1989)  and  Watterson  (1991). 

5  Discussion 

Various  generalisations  of  the  MTD  model  have  been  proposed.  Raftery  ( 1985a, b)  proposed 
ways  of  modelling  the  case  where  m  —  00,  such  as  when  the  observations  are  counts.  Adke 
and  Deshmukh  (1988)  showed  that  asymptotic  properties  valid  when  m  is  finite  also  apply 
when  m  =  00.  It  seems  that  our  estimation  method  will  work  in  that  case  also,  provided 
that  the  (now  doubly  infinite)  matrix  Q  is  modelled  parametrically.  If  m  =  cc  and 

liminf  g(i|j)  =  0  Vj,  (12) 
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then  the  constraints  (5)  are  equivaient  to  the  positivity  assumption  (6),  and  the  computa¬ 
tional  problem  is  greatly  simplified. 

Mehran  (1989)  considered  the  infinite  lag  MTD  model,  l  =  oo,  where  A j  is  a  parametric 
function  of  j.  Our  method  seems  applicable  in  this  case  also,  although  calculating  the 
likelihood,  or  the  fitted  values  for  minimum  x 2  estimation,  seems  difficult.  It  may  be  possible 
to  model  discrete-valued  time  series  with  the  long-memory  property  using  this  approach,  by 
setting  the  A  j  equal  to  the  7r-weights  for  the  fractionally-differenced  ARIMA  (p,  d,  q)  process. 
Various  continuous-valued  environmental  time  series  such  as  wind  speeds  are  of  this  kind 
(Haslett  and  Raftery,  1989),  and  it  seems  reasonable  to  suppose  that  some  discrete-valued 
time  series  might  have  this  property  also. 

Martin  and  Raftery  (1987)  and  Adke  and  Deshmukh  (1988)  pointed  out  that  the  MTD 
model  remains  well-defined  for  arbitrary  state  spaces,  which  need  not  be  finite,  countable  or 
even  discrete.  Equations  (1)  and  (2)  remain  valid  if  p  and  q  are  interpreted  as  conditional 
densities,  where  q  will  usually  have  some  parametric  form.  Le,  Martin  and  Raftery  (1990) 
have  shown  that  this  provides  a  framework  for  modelling  bursts,  outliers  and  flat  stretches  in 
continuous- valued  time  series,  and  also  models  time  series  that  are  well  fitted  by  conventional 
Gaussian  ARM  A  models.  If  condition  (12)  holds,  then  so  does  the  positivity  assumption  (6). 
However,  this  is  not  always  the  case,  even  when  the  state  space  is  continuous.  For  example, 
in  continuous-valued  directional  time  series,  (12)  does  not  necessarily  hold.  Craig  (1989)  has 
investigated  MTD  and  other  models  for  this  situation,  and  has  studied  the  consequences  of 
(12)  not  holding.  Breckling  (1989)  has  also  studied  such  time  series. 
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Caption  for  Figure  1 


Histograms  of  wind  directions  in  Roche’s  Point,  Ireland,  1961-1969. 

Analysis  by  year. 

Wind  directions  are  measured  in  units  of  10°  from  North.  0  indicates  no  wind. 


Caption  for  Figure  2 

Histograms  of  wind  directions  in  Roche’s  Point,  Ireland,  1961-1969. 

Analysis  by  month. 

Wind  directions  are  measured  in  units  of  10°  from  North.  0  indicates  no  wind. 


963 


Table  1:  Wind  direction  data  from  Roche’s  Point,  Ireland1 


Order,  / 

Number  of 
parameters,  k 

BIC 

(full  model) 

Number  of 
parameters,  k 

BIC 

(MTD  model) 

0 

4 

12,716.5 

1 

20 

5,085.7 

20 

5,085.7 

2 

100 

5,198.8 

21 

4,646.4 

3 

500 

8,243.3 

22 

4,569.5 

4 

2,500 

24,674.7 

23 

4,557.5 

5 

- 

- 

24 

4,544.' 

6 

- 

- 

25 

4,539.8 

7 

- 

- 

26 

4,538.8 

8 

- 

- 

27 

4,540.8 

9 

- 

- 

28 

4,540.4 

10 

“ 

“ 

29 

4,545.4 

n  =  4, 325  observations  starting  at  position  20  in  the  sequence. 


Table  2:  Intron  from  mouse  T-cell  receptor  6 /a  locus1. 


Order,  l 

Number  of 

BIC 

Number  of 

BIC 

parameters,  k 

(full  model) 

parameters,  k 

(MTD  modJ) 

_ 1 

Alphabet: 

A,C,G,T 

0 

3 

15,980.1 

1 

12 

15,549.3 

2 

48 

15,756.9 

13 

3 

192 

16,828.7 

14 

Alphabet: 

A/G,  C,T 

0 

2 

12,042.0 

1 

6 

11,900.1 

2 

18 

11,939.7 

7 

11,885.5 

3 

54 

12,204.7 

8 

11,890.2 

Alphabet: 

A/G,  C/T 

■■ 

0 

2 

8,005.6 

1 

4 

7,881.8 

2 

8 

7,850.5 

3 

3 

16 

7,878.2 

4 

1 


n  =  5,369  bases,  starting  at  position  10  in  the  sequence. 


Table  3:  Introns  from  Mouse  aA-crystallin  gene1 


Order,  / 

Number  of 

BIC 

Number  of 

BIC 

parameters,  k 

(full  model) 

parameters,  k 

(MTD  model) 

Alphabet: 

A,C,G,T 

i 

1 

0 

3 

3,620.8 

1 

12 

3,559.7 

1 

2 

48 

3,758.8 

13 

3,566.1 

3 

192 

4,542.8 

14 

3,572.8 

Alphabet: 

A/G,C,T 

0 

2 

2,739.0 

1 

6 

2,728.7 

2 

18 

2,786.6 

7 

2,722.7 

3 

54 

2,973.2 

8 

2,729.4 

Alphabet: 

A/G,C/T 

0 

1 

1,810.9 

1 

2 

1,792.8 

2 

4 

1,798.1 

3 

1,791.3 

3 

8 

1,813.8 

4 

1,797.1 

1  Two  introns,  n=1302  bases,  starting  at  position  6  in  the  sequence. 
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The  mixture  transition  distribution  (MTD)  model  was  introduced  by  Raftery  (1985' 
as  a  parsimonious  model  for  high-order  Markov  chains.  It  is  flexible,  can  represent  a 
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appears  to  be  a  discrete-valued  analogue  for  the  class  of  autoregressive  time  series 
models.  However,  estimation  has  presented  difficulties  because  the  parameter  space  is 
highly  nonconvex,  being  defined  by  a  large  number  of  nonlinear  constraints. 

Here  we  propose  an  efficient  computational  algorithm  for  maximum  likelihood  esti¬ 
mation  which  is  based  on  a  way  of  reducing  the  large  number  of  constraints.  This  also 
allows  more  structured  versions  of  the  model,  for  example  those  involving  structural 
zeros,  to  be  fit  quite  easily.  A  way  of  fitting  the  model  using  GLIM  is  also  discussed. 

The  algorithm  is  applied  to  a  sequence  of  wind  directions,  and  also  to  two  sequences 
of  DNA  bases  from  introns  from  genes  of  the  mouse.  In  each  case,  the  MTD  model  fits 
better  than  the  conventional  Markov  chain  model. 


